Diversity patterns of the Flora of Greece with applications in R
You can find all files in github.
1 Before we start
Before you start downloading data from the internet, you need to create a project in R-Studio. The name of the project, as well as its file path should NOT include the following:
1. Greek letters
2. spaces (replace the spaces with _)
For example your file path could be: C:/Environmental_Data_Analysis/Geographical_Data.
You also need to install and load all the packages that we are going to use today1, if you have not done that already.
2 Online databases
2.1 Install and load packages
We will have to load some libraries first:
## ===========================================================================##
## Load them
## ===========================================================================##
library(tidyverse)
library(rgbif)
library(raster)
library(rgeos)
library(rgdal)
library(rinat)
library(countrycode)
library(devtools)
library(sf)
library(dismo)
library(taxize)
library(data.table)
library(CoordinateCleaner)
library(ggspatial)
library(scales)
library(skimr)
library(plotly)
library(scrubr)
library(mapr)
library(rasterVis)
## ===========================================================================##2.2 Download vector data for Greece
We can download some vector data in the form of polygons. We can use the online database GADM to download country boundaries.
We will not delve into any details for now - we will do that in the next practical. We need these data for visualization purposes2.
## ===========================================================================##
## Download the data for Greece
## ===========================================================================##
## There are three levels available, you are free to explore them
Greece <- raster::getData("GADM", country = "GRC", level = 3)
## ===========================================================================##We can easily plot the data we have just downloaded.
## ===========================================================================##
## Plot Greece
## ===========================================================================##
plot(Greece)It is equally easy to subset those data.
## ===========================================================================##
## First we need to see what is inside the object we created
## ===========================================================================##
Greece## class : SpatialPolygonsDataFrame
## features : 326
## extent : 19.37236, 29.6457, 34.80069, 41.74801 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## variables : 16
## names : GID_0, NAME_0, GID_1, NAME_1, NL_NAME_1, GID_2, NAME_2, NL_NAME_2, GID_3, NAME_3, VARNAME_3, NL_NAME_3, TYPE_3, ENGTYPE_3, CC_3, ...
## min values : GRC, Greece, GRC.1_1, Aegean, <U+0386><U+03B8><U+03C9><U+03C2>, GRC.1.1_1, Athos, <U+0386><U+03B8><U+03C9><U+03C2>, GRC.1.1.1_1, Abdera, Abelokipi-Menemeni, <U+0386><U+03B8><U+03C9><U+03C2>, Dímos, Municipality, NA, ...
## max values : GRC, Greece, GRC.8_1, Thessaly and Central Greece, Tessa<U+03BB><U+03AF>a <U+03BA>a<U+03B9> Ste<U+03C1>e<U+03AC> <U+0395><U+03BB><U+03BB><U+03AC>da, GRC.8.2_1, West Macedonia, Tessa<U+03BB><U+03AF>a, GRC.8.2.9_1, Zografou, Zografos, Tessa<U+03BB><U+03BF><U+03BD><U+03AF><U+03BA><U+03B7><U+03C2>, Dímos, Municipality, NA, ...
We see that our object has a Coordinate Reference System (crs) and that it includes several columns (as indicated by the slot names).
Next, we are going to see how many columns our object includes:
## ===========================================================================##
## Check the names
## ===========================================================================##
names(Greece)## [1] "GID_0" "NAME_0" "GID_1" "NAME_1" "NL_NAME_1" "GID_2"
## [7] "NAME_2" "NL_NAME_2" "GID_3" "NAME_3" "VARNAME_3" "NL_NAME_3"
## [13] "TYPE_3" "ENGTYPE_3" "CC_3" "HASC_3"
Finally, we are going to create a new spatial object by subsetting Greece. We do so, by using one of the columns present in Greece.
## ===========================================================================##
## Subset Greece
## ===========================================================================##
Crete <- subset(Greece, NAME_1 == "Crete")
Crete_sf <- st_as_sf(Crete)Let’s plot Crete now
## ===========================================================================##
## Plot Crete
## ===========================================================================##
plot(Crete)2.3 Introduction
As we have mentioned earlier, the most basic category of vector data are points. In ecology, points most often represent sampling locations or field observation in general. These observations consist primarily species presence in one site. Nowadays, it is fairly easy to obtain species occurrence data through freely available, online databases, such as GBIF, BIEN and iNaturalist. Bear in mind these databases refer to current/recent species occurrences, yet there are others, such as the Paleobiology and Neotoma databases3, that contain information regarding fossil data. There are other databases as well, such as the PREDICTS and the COMPADRE, which deal with time-series and population matrices, respectively. Obviously if someone’s interested in conservation-related data, then the IUCN is the best place to search for this kind of data. You can find several other databases here.
Today we are going to explore the GBIF and the iNaturalist databases, but you are free to explore the other ones as well. The former contains more than 1.3 billion records, which can be easily accessed for ecological analyses, while the latter holds significantly less records and is essentially a citizen-science database. The ease of access and use is slightly counterbalanced by the fact that those databases contain several erroneous occurrences (e.g., misidentifications in the case of iNaturalist, coordinates falling into the sea in the case of GBIF - see Maldonado et al.(2015) for more).
Let’s first download some point data for Petromarula pinnata, one of the six endemic genera of Greece.
Fig 1. The endemic species Petromarula pinnata from Crete - Photo from idr.
2.3.1 iNaturalist
First we will download data from the iNaturalist database. In order to merge those data with the one we will download from GBIF, we will have to add a column named countryCode and remove any rows with no geographical information4.
## ===========================================================================##
## Download data and create an additional column
## ===========================================================================##
petromarula_inat <- get_inat_obs(taxon_name = "Petromarula pinnata") %>% mutate(countryCode = "GR")
## ===========================================================================##
## ===========================================================================##
## Remove any rows without coordinates
## ===========================================================================##
petromarula_inat <- petromarula_inat[complete.cases(petromarula_inat[, 5:6]), ]
## Why do we use the fifth and sixth column to remove any empty rows?
## ===========================================================================##Next, we will have to perform some checks in order to remove any erroneous or suspicious coordinates.
## ===========================================================================##
## Clean the coordinates
## ===========================================================================##
flags_rinat <- clean_coordinates(x = petromarula_inat, lon = "longitude", lat = "latitude",
countries = "countryCode", species = "scientific_name", tests = c("capitals",
"centroids", "equal", "gbif", "zeros", "seas"), seas_ref = Crete)
## ===========================================================================##
## ===========================================================================##
## Keep only those that are OK
## ===========================================================================##
flags_rinats <- flags_rinat %>% mutate(Data = ifelse(.summary == "TRUE", "Clean",
"Flagged"), NAME_1 = "Crete", Dataset = "iNaturalist") %>% dplyr::filter(str_detect(Data,
"Cle")) %>% dplyr::select(scientific_name, longitude, latitude, Dataset)
## ===========================================================================##You could run the above functions for a species that might interest you.
2.3.2 GBIF
Let’s now download data from GBIF for the same species. The first step is to check that the species is indeed referenced in GBIF. The function name_suggest() from the rgbif package searches all species that more or less fit the request. Only the species names from GBIF that exactly match the name of our species, are kept.
The previous step identifies the taxonomic reference number of our species. Using this reference is the safest way to request species occurrences in GBIF.
Now let’s download all the occurrences that are present in GBIF and have geographical information (We will download max. 1000 occurrences)
## ===========================================================================##
## Download data
## ===========================================================================##
sp_occ <- occ_search(taxonKey = spp_petromarula, country = "GR", hasCoordinate = T,
limit = 1000)
## ===========================================================================##
## ===========================================================================##
## To prevent any problem with the pathway it is a good practice to remove blank
## spaces from species names
## ===========================================================================##
sp_occ$data$name <- sub(" ", "_", sp_occ$data$name)
## ===========================================================================##
## ===========================================================================##
## Total number of occurrences
## ===========================================================================##
sort(table(sp_occ$data$name), decreasing = T)##
## Petromarula_pinnata A.DC. Phyteuma_pinnatum L.
## 116 1
It’s not unusual GBIF to contain (at least some) data from iNaturalist, so we will remove any occurrences that come from the latter database.
## ===========================================================================##
## Remove iNaturalist occurrences
## ===========================================================================##
sp_occ_rem <- sp_occ$data %>% filter(!str_detect(references, "inaturalist"))
## ===========================================================================##
## ===========================================================================##
## How many occurrences in GBIF are also in iNaturalist?
## ===========================================================================##
sp_occ$data %>% filter(str_detect(references, "inatu")) %>% NROW()## [1] 26
Next, we will have to perform again some checks in order to remove any erroneous or suspicious coordinates.
## ===========================================================================##
## Clean the coordinates
## ===========================================================================##
flags_gbif <- clean_coordinates(x = sp_occ_rem, lon = "decimalLongitude", lat = "decimalLatitude",
countries = "countryCode", species = "species", tests = c("capitals", "centroids",
"equal", "gbif", "zeros", "seas"), seas_ref = Crete)
## ===========================================================================##
## ===========================================================================##
## Keep only those that are OK
## ===========================================================================##
flags_gbif <- flags_gbif %>% mutate(Data = ifelse(.summary == "TRUE", "Clean", "Flagged"),
NAME_1 = "Crete", Dataset = "GBIF", longitude = decimalLongitude, latitude = decimalLatitude,
scientific_name = "Petromarula pinnata") %>% dplyr::filter(str_detect(Data, "Cle")) %>%
dplyr::select(scientific_name, longitude, latitude, Dataset)
## ===========================================================================##You could run the above functions for a species that might interest you.
2.3.3 Join iNaturalist and GBIF data
We are now in place to join both datasets and convert them into a spatial object.
flags_both <- full_join(flags_gbif, flags_rinats) %>% st_as_sf(., coords = c("longitude",
"latitude"), crs = 4326)Let’s visualise the result.
## ===========================================================================##
## First convert the Dataset column to a factor
## ===========================================================================##
flags_both$Dataset <- factor(flags_both$Dataset)
## ===========================================================================##
## ===========================================================================##
## Then add the centroids of each Cretan county
## ===========================================================================##
Crete_sf <- cbind(Crete_sf, st_coordinates(st_centroid(Crete_sf)))
## ===========================================================================##
## ===========================================================================##
## Then do the plotting
## ===========================================================================##
ggplot(Crete_sf) + geom_sf(fill = "antiquewhite1") + geom_sf() + # geom_label(data = Crete_sf, aes(X, Y, label = NAME_3), size = 2, fontface =
# 'bold') +
geom_sf(data = Crete_sf, fill = NA) + geom_sf(data = flags_both, size = 2, aes(color = Dataset,
fill = Dataset, shape = Dataset)) + annotation_scale(location = "bl", width_hint = 0.4) +
annotation_north_arrow(location = "bl", which_north = "true", pad_x = unit(0.75,
"in"), pad_y = unit(0.5, "in"), style = north_arrow_fancy_orienteering) +
coord_sf(xlim = c(23.4, 26.4), ylim = c(34.7, 35.7), expand = FALSE) + xlab("Longitude") +
ylab("Latitude") + ggtitle("Occurrences available online", subtitle = expression(italic("Petromarula pinnata"))) +
theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed",
size = 0.5), panel.background = element_rect(fill = "aliceblue"))2.3.4 Download GBIF and iNaturalist data for a specified polygon
We can also download species occurrences data from both databases by providing a polygon that we are interested in. Let’s see how that works for iNaturalist and GBIF. We shall do it first for the GBIF database.
## ============================================================================##
## First, create a vector of the species you are interested in
## ============================================================================##
splist <- c("Centaurea idaea", "Petromarula pinnata", "Dianthus juniperinus")
## ============================================================================##
## ============================================================================##
## Download data for them
## ============================================================================##
keys <- sapply(splist, function(x) name_suggest(x)$data$key[1], USE.NAMES = FALSE)
## ============================================================================##
## ============================================================================##
## Create an WKT file for the download
## ============================================================================##
ex <- raster::extent(Crete)
ex <- as(ex, "SpatialPolygons")
ex <- rgeos::writeWKT(ex)
## It seems that there is a problem with the number of digits, so we trimmed them
ex2 <- "POLYGON((23.47347 34.80068, 23.47347 35.69597, 26.35347 35.6959, 26.35347 34.80068, 23.47347 34.80068))"
## ============================================================================##
## ============================================================================##
## Download the data for these species from GBIF ----
## ============================================================================##
dat_ne <- occ_search(taxonKey = keys, hasCoordinate = T, limit = 1000, geometry = ex2)
## ============================================================================##
## ============================================================================##
## Convert it to a dataframe ----
## ============================================================================##
for (i in seq_along(dat_ne)) {
dat_ne[[i]] <- dat_ne[[i]]$data
}
dat_ne <- lapply(dat_ne, "as.data.frame")
dat_ne <- bind_rows(dat_ne)
## ============================================================================##Try running the above functions for some species that might be present in Epirus (e.g., Ophrys helenae).
And then, in a similar manner for the iNaturalist database.
## ============================================================================##
## First, create a vector of the species you are interested in
## ============================================================================##
gen_list <- c("Centaurea idaea", "Petromarula pinnata")
## ============================================================================##
## ============================================================================##
## Create the desired extent
## ============================================================================##
## We are interested in Crete
extent(Crete)## class : Extent
## xmin : 23.47347
## xmax : 26.35347
## ymin : 34.80069
## ymax : 35.69597
bounds <- c(34.80069, 23.47347, 35.69597, 26.35347)
## ============================================================================##
## ============================================================================##
## Create an empty list to store them and run a for loop
## ============================================================================##
plants <- list()
for (i in seq_along(gen_list)) {
plants[[i]] <- get_inat_obs(taxon_name = gen_list[[i]], bounds = bounds)
}
## ============================================================================##
## ============================================================================##
## Convert the list to a dataframe
## ============================================================================##
rinat_df <- do.call(rbind.data.frame, plants)
## ============================================================================##2.3.5 Download GBIF data for a specific country
Finally, we can download all the plant species that occur in a country (or a specified polygon as exemplified earlier) and are present in GBIF.
##============================================================================##
## Use the name_suggest function to get the gbif taxon key
##============================================================================##
tax_key <- name_suggest(q = "Magnoliopsida", rank = "Class")$data
##============================================================================##
##============================================================================##
## Sometimes groups have multiple taxon keys, so we will
## check how many records are available for them
##============================================================================##
lapply(tax_key$key, "occ_count")
##============================================================================##
## The first one is relevant
##============================================================================##
tax_key <- tax_key$key[1]
##============================================================================##
##============================================================================##
## Let's download some plant species for Crete
##============================================================================##
data_gr <- occ_search(taxonKey = tax_key,
hasCoordinate = T, ## We only want data with coordinates
geometry = ex2, ## Here we define our study area polygon
limit = 3000) ## We only want 3000 occurrences
##============================================================================##
##============================================================================##
## And keep only the actual data
##============================================================================##
data_gr <- data_gr$data
##============================================================================##2.3.6 Download GBIF data for a specific country
Even though we can download many species occurrences from within R, there is limit to how many of these occurrences we can actually download.
So, we have downloaded all the plant occurrences for Italy5. Let’s now load them and start exploring them.
This can be a lengthy process depending on your PC’s specificiations. You can load both the raw and the cleaned data with the following commands:
data_it <- readRDS('Data/RDS/Italian data.rds')andflags_gbif <- readRDS('Data/RDS/Italian data cleaned.rds')
## ============================================================================##
## Load the Italian data
## ============================================================================##
data_it <- fread("I:/SDM DATA/GBIF/PLANTS/Italy/0104713-200613084148143.csv") %>%
setDF() %>% as_tibble()
## ============================================================================##
## ============================================================================##
## Clean them
## ============================================================================##
flags_gbif <- clean_coordinates(x = data_it, lon = "decimalLongitude", lat = "decimalLatitude",
countries = "countryCode", species = "species", tests = c("capitals", "centroids",
"equal", "gbif", "zeros", "seas")) %>% dplyr::filter(.summary == "TRUE" &
coordinateUncertaintyInMeters <= 5000) %>% dplyr::select(decimalLongitude, decimalLatitude,
institutionCode, taxonRank, class, order, family, genus, scientificName)
## ============================================================================##
## Save them for future reference
## ============================================================================##
saveRDS(data_it, "Italian data.rds")
saveRDS(flags_gbif, "Italian data cleaned.rds")We can now visualise these data. Let’s see how many occurrences there are for each taxonomic unit (taxonRank), as well as which institutions host these plant taxa (institutionCode).
## ============================================================================##
## Taxonomic rank
## ============================================================================##
data_it %>% dplyr::count(taxonRank)| taxonRank | n |
|---|---|
| CLASS | 3 |
| FAMILY | 1839 |
| FORM | 1332 |
| GENUS | 14662 |
| ORDER | 8 |
| PHYLUM | 9 |
| SPECIES | 922917 |
| SUBSPECIES | 97262 |
| VARIETY | 7533 |
## ============================================================================##
## Institution code
## ============================================================================##
data_it %>% dplyr::count(institutionCode, sort = TRUE) %>% head()| institutionCode | n |
|---|---|
| 482596 | |
| BIO-UNIPI | 308257 |
| iNaturalist | 102451 |
| naturgucker | 33354 |
| INFOFLORA | 32041 |
| TLMF | 15110 |
As a next step, let’s see which Order, which Family, which Genus and which Species have the most occurrences in Italy, according to GBIF.
## ============================================================================##
## Order
## ============================================================================##
data_it %>% dplyr::count(order, sort = TRUE) %>% drop_na(order) %>% filter(n > 5000) %>%
ggplot(aes(x = reorder(order, n), y = n, fill = order)) + geom_bar(stat = "identity",
show.legend = FALSE) + labs(x = "Order", y = "Number of Occurrence Records (observations)") +
coord_flip()## ============================================================================##
## Family
## ============================================================================##
data_it %>% dplyr::count(family, sort = TRUE) %>% drop_na(family) %>% filter(n >
10000) %>% ggplot(aes(x = reorder(family, n), y = n, fill = family)) + geom_bar(stat = "identity",
show.legend = FALSE) + labs(x = "Family", y = "Number of Occurrence Records (observations)") +
coord_flip()## ============================================================================##
## Genus
## ============================================================================##
data_it %>% dplyr::count(genus, sort = TRUE) %>% drop_na(genus) %>% filter(n > 4500) %>%
ggplot(aes(x = reorder(genus, n), y = n, fill = genus)) + geom_bar(stat = "identity",
show.legend = FALSE) + labs(x = "Genus", y = "Number of Occurrence Records (observations)") +
coord_flip()## ============================================================================##
## Species
## ============================================================================##
data_it %>% dplyr::count(scientificName, sort = TRUE) %>% drop_na(scientificName) %>%
filter(n > 2500) %>% ggplot(aes(x = reorder(scientificName, n), y = n, fill = scientificName)) +
geom_bar(stat = "identity", show.legend = FALSE) + labs(x = "scientificName",
y = "Number of Occurrence Records (observations)") + coord_flip()We can also check the temporal dimension of these data (e.g., when were they collected?):
## ============================================================================##
## Vizualise the time-frame
## ============================================================================##
out <- data_it %>% drop_na(year) %>% dplyr::count(year) %>% ggplot(aes(x = year,
y = n, group = 1)) +
## You can change the time limits
xlim(1748, 2020) + geom_line() +
## And the y axis limits
ylim(0, 350000)
ggplotly(out)Run the above for the cleaned dataset. Do you observe any differences?
Hint: You can use skim(data_it) and skim(flags_gbif) for a quick overview.
Which species of the genus Silene has the most occurrences in Italy?
## ============================================================================##
## The solution
## ============================================================================##
data_it %>% dplyr::count(genus, scientificName, sort = TRUE) %>% drop_na(scientificName) %>%
filter(genus == "Silene") %>% filter(n > 5) %>% ggplot(aes(x = reorder(scientificName,
n), y = n, fill = scientificName)) + geom_bar(stat = "identity", show.legend = FALSE) +
labs(x = "Silene species", y = "Number of Occurrence Records (observations)") +
coord_flip()2.3.7 Download GBIF data for several species
As mentioned earlier, there is a limit in how many species we can download from within R. Let’s say that we have 150 tree species for which we want to download all of their occurrences from most Mediterranean countries. If you look at GBIF, you’ll see that these occurrences are \(> 8M\) - that’s quite a lot.
So what can we do? First of all, download and save in your working directory an excel file with the names of those species.
Then, we are going to match those names with the names present in GBIF:
## ============================================================================##
## Match the names with GBIF -----
## ============================================================================##
gbif_taxon_keys <- readxl::read_excel("Test data for the practical.xlsx") %>%
# use fewer names if you want to just test
pull("Taxon name") %>%
# match names to the GBIF backbone to get taxonkeys
taxize::get_gbifid_(method = "backbone") %>%
# add original name back into data.frame
imap(~.x %>% mutate(original_sciname = .y)) %>%
# combine all data.frames into one
bind_rows() %T>%
# save as side effect for you to inspect if you want
readr::write_tsv(path = "all_matches.tsv") %>%
# get only accepted and matched names
filter(matchtype == "EXACT" & status == "ACCEPTED") %>%
# remove anything that might have matched to a non-plant
filter(kingdom == "Plantae") %>%
# get the gbif taxonkeys
pull(usagekey)
# gbif_taxon_keys should be a long vector like this
# c(2977832,2977901,2977966,2977835,2977863)
## ============================================================================##As a next step, we are going to create a query, asking the GBIF portal to download these data for us.
##============================================================================##
## State your credentials -----
##============================================================================##
user <- "" # your gbif.org username
pwd <- "" # your gbif.org password
email <- "" # your email
##============================================================================##
##============================================================================##
## Download the data -----
##============================================================================##
occ_download(
pred_in("taxonKey", gbif_taxon_keys),
pred_in("basisOfRecord",
c('PRESERVED_SPECIMEN',
'HUMAN_OBSERVATION',
'OBSERVATION',
'MACHINE_OBSERVATION',
'PRESERVED_SPECIMEN')),
# pred_gt("elevation", 5000),
pred_in("country",
c('AL',
'HR',
'CY',
'FR',
'GR',
'IT',
'MT',
'ME',
'PT',
'SI',
'ES')),
pred("hasCoordinate", TRUE),
pred("hasGeospatialIssue", FALSE),
# pred_gte("year", 1990),
format = "SIMPLE_CSV",
user = user,
pwd = pwd,
email = email)
##============================================================================##So now we can download the data we requested6. In the following few lines of code, we will see how we can import these massive data (4.7GB) into R, but it is for demonstrantion purposes only.
Do not run this code now in your PC, since it may take some time in order to read the data7. You can read into R that excel file, using the fread function from the data.table R package:
## ============================================================================##
## Read and save the data -----
## ============================================================================##
gbif_tree_data <- fread("0086467-200613084148143/0086467-200613084148143.csv")
## ============================================================================##References
Maldonado, C., Molina, C.I., Zizka, A., Persson, C., Taylor, C.M., Albán, J., Chilquillo, E., Rønsted, N., & Antonelli, A. (2015) Estimating species diversity and distribution in the era of b ig d ata: To what extent can we trust public databases? Global Ecology and Biogeography, 24, 973–984.
There are two ways to do that fast:
1.library(pacman); p_load('package_name_1', 'package_name_2')
2.install.packages(c('package_name_1', 'package_name_2'), dependecies = TRUE)
Remember that if you use the first way, you need to install packagepacmanfirst.↩You are free to ask whatever you want however↩
With no coordinates to be exact↩
You can download this file from the github repository or on your own from GBIF.↩
It needs some time to do that, so we have to be a little bit patient.↩
You can do that after the webinar finishes, if you want.↩